library(tidyverse)
library(dendextend)
library(hopkins)
library(cluster)
library(gridExtra)
library(Rtsne)
library(fpc)
library(factoextra)
library(flexclust)
library(ggplot2)
library(fpc)
library(dbscan)
library(clusterSim)
Volleyball Nations League (VNL), also called FIVB, is an annual international volleyball competition contested by the senior national teams. It is the biggest competition in volley world each and it draws attention from many volleyball connoisseurs every year. The paper aims to cluster the men volleyball players into different groups, based on their capability statistics on blocking, attacking, receiving, etc, throughout the year 2024. By analyzing key performance metrics, it is hoped to uncover meaningful patterns and group players with similar playing styles and strengths. The results may also possibly provide a reference basis for team strategies and tactical formulations. Additionally, they can also discover potential trends to help a team stay competitive, and identify and analyze their competitors.
set.seed(123)
row_names <- rownames(data)
data_z <- as.data.frame(lapply(data, scale)) # Z-score standardization
rownames(data_z) <- row_names
hopkins(data_z)
## [1] 1
k_range <- 2:10
hopkins_values <- numeric(length(k_range))
for (i in seq_along(k_range)) { # Compute Hopkins statistic for each k
k <- k_range[i]
result <- get_clust_tendency(data_z, k, graph = FALSE)
hopkins_values[i] <- result$hopkins_stat
}
plot(k_range, hopkins_values, type = "b", pch = 16, col = "blue",
xlab = "Number of Clusters", ylab = "Hopkins Statistic",
main = "Hopkins Statistic for Different Number of Clusters")
The hopkins statistic is close to 1 which means that the dataset is highly likely to contain meaningful clusters since it shows the high degree of clustering tendency in the dataset. It naturally has well-defined clusters. We also can see from the plot that 2, 4, and 7 clusters for the dataset provide better hopkins scores. These numbers can also be a reference for the following hierarchical clustering, K-Means clustering and PAM clustering.
method <- c( "average", "single", "complete", "ward")
names(method) <- c( "average", "single", "complete", "ward")
agglomerative_coefficient <- function(x) {
agnes(data_z, method = x)$ac
}
map_dbl(method, agglomerative_coefficient)
## average single complete ward
## 0.8871528 0.7777406 0.9070055 0.9747774
After finding the agglomerative coefficients of different methods in hierarchical clustering, we can see that the methods for linkage, complete and ward should be used for hierarchical clustering because the values closer to 1 which suggest strong clustering structure. Additionally, the high divisive coefficient calculated below also indicates a good separation of clusters.
hc_elbow_complete <- fviz_nbclust(data_z, FUN = hcut, method = "wss", hcut_args = list(method = "complete"))
hc_sil_complete <- fviz_nbclust(data_z, FUN = hcut, method = "silhouette", hcut_args = list(method = "complete"))
hc_gap_complete <- fviz_gap_stat(clusGap(data_z, FUN = hcut, nstart = 25, K.max = 10, B = 50, hcut_args = list(method = "complete")))
hc_elbow_ward <- fviz_nbclust(data_z, FUN = hcut, method = "wss", hcut_args = list(method = "ward.D"))
hc_sil_ward <- fviz_nbclust(data_z, FUN = hcut, method = "silhouette", hcut_args = list(method = "ward.D"))
hc_gap_ward <- fviz_gap_stat(clusGap(data_z, FUN = hcut, nstart = 25, K.max = 10, B = 50, hcut_args = list(method = "ward.D")))
grid.arrange(hc_elbow_complete, hc_elbow_ward, hc_sil_complete, hc_sil_ward, hc_gap_complete, hc_gap_ward,
ncol = 2, top = "Comparison of Elbow, Silhouette, Gap: Complete (Left) vs. Ward (Rgiht)")
Based on the plots, the optimal numbers of clusters from hierarchical clustering are suggested to be 4 (elbow and silhouette) and 9 (gap statistics) respectively. In such case, we will try to use “4” as the optimal clusters.
hc_ward <- agnes(data_z, method = "ward")
hc_ward_group <- cutree(hc_ward, k = 4) # Cut tree into 4 groups
pltree(hc_ward, cex = 0.6, hang = -1, main = "Dendrogram - Agnes - Ward")
rect.hclust(hc_ward, k = 4, border = 2:5)
hc_complete <- agnes(data_z, method = "complete")
pltree(hc_complete, cex = 0.6, hang = -1, main = "Dendrogram - Agnes - Complete")
rect.hclust(hc_complete, k = 4, border = 2:5)
hc_diana <- diana(data_z)
hc_diana$dc # Divise coefficient
## [1] 0.9053562
hc_diana_group <- cutree(hc_diana, k = 4) # Cut tree into 4 groups
pltree(hc_diana, cex = 0.6, hang = -1, main = "Dendrogram - Diana")
rect.hclust(hc_diana, k = 4, border = 2:5)
tanglegram(as.dendrogram (hc_ward), as.dendrogram (hc_diana))
From the above analysis, we can see that by hierarchical clustering, it suggests that 4 clusters can be the optimal. But it may not be the best way to cluster since the sizes of each cluster are not similar. On the other hand, by comparing the two dendrograms generated by agglomerative and divisive approaches, the lines between two dendrograms are crossing a lot, which implies that the different approaches for hierarchical clustering will possibly generate different results.
par(mfrow = c(2, 2))
db_eps_2 <- dbscan::kNNdistplot(data_z_no_duplicates, k = 2)
abline(h = 3.6, lty = 2)
db_eps_3 <- dbscan::kNNdistplot(data_z_no_duplicates, k = 3)
abline(h = 4, lty = 2)
db_eps_4 <- dbscan::kNNdistplot(data_z_no_duplicates, k = 4)
abline(h = 4, lty = 2)
db_eps_5 <- dbscan::kNNdistplot(data_z_no_duplicates, k = 5)
abline(h = 4.2, lty = 2)
par(mfrow = c(1, 1))
dbscan <- fpc::dbscan(data_z_no_duplicates, eps = 4, MinPts = 2)
print(dbscan)
## dbscan Pts=280 MinPts=2 eps=4
## 0 1
## border 3 0
## seed 0 277
## total 3 277
dbscan <- fpc::dbscan(data_z, eps = 4, MinPts = 5)
print(dbscan)
## dbscan Pts=292 MinPts=5 eps=4
## 0 1
## border 3 3
## seed 0 286
## total 3 289
dbscan <- fpc::dbscan(data_z_no_duplicates, eps = 2, MinPts = 3.3)
print(dbscan)
## dbscan Pts=280 MinPts=3.3 eps=2
## 0 1 2 3
## border 64 17 3 3
## seed 0 187 5 1
## total 64 204 8 4
In the above cases, regardless of setting k = 2, 3, 4, or 5, all the plots suggests eps is around 4. However, when DBSCAN was run with eps = 4 and MinPts = 2, the result is 1 cluster. With eps = 2 and MinPts = 3.3, DBSCAN is suggesting 3 clusters. It could mean that the lower eps allows the algorithm to detect more meaningful local densities and separate the data into 3 groups. Therefore, after testing multiple values, in terms of separation of clusters, 2 eps is decided to be used to group the data depending on the density.
pca_dbscan <- prcomp(data_z_no_duplicates, scale. = TRUE)
cluster_dbscan <- data.frame(pca_dbscan$x[, 1:2], Cluster = as.factor(dbscan$cluster))
ggplot(cluster_dbscan, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3, alpha = 0.8) +
theme_minimal() + theme(panel.grid = element_blank()) +
labs(title = "DBSCAN Clustering Visualized by PCA",
x = "Principal Component 1",
y = "Principal Component 2")
tsne_dbscan <- Rtsne(data_z_no_duplicates, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 1000)
colnames(tsne_dbscan$Y) <- c("tSNE1", "tSNE2")
tsne_data_dbscan <- data.frame(tsne_dbscan$Y, Cluster = as.factor(dbscan$cluster))
ggplot(tsne_data_dbscan, aes(x = tSNE1, y = tSNE2, color = Cluster)) +
geom_point(size = 3, alpha = 0.8) +
theme_minimal() + theme(panel.grid = element_blank()) +
labs(title = "DBSCAN Clustering Visualized by t-SNE", x = "t-SNE Dimension 1", y = "t-SNE Dimension 2")
count(data_z_no_duplicates[dbscan$cluster == 0, ]) # Extract noise points
## n
## 1 64
Noise points (so-called outliers) are assigned to cluster 0 which indicates that they do not belong to any of the main clusters identified by the algorithm. It is not within the eps distance of enough other points to form a dense region. The data points from cluster 0 are identified as not part of any meaningful clusters (either cluster 1, 2, and 3).
From PCA and t-SNE, they all show 3 clusters and cluster 0 is recongnized as unimportant data points. In term of cluster seperations, it is apparent that the number of data points in different clusters are not really even and not well-divided. 64 data points are treated as noise / outliers and they are not assigned to any cluster, which this significant number of points might be important for analysis. Since DBSCAN identifies clusters based on density. If certain regions in the data have lower point density, DBSCAN might fail to form clusters and label them as noise. Furthermore, DBSCAN also detects duplicates and is unable to allocate them to a cluster. Therefore, there will be a potential scrutiny of loss of valuable data and unrepresentation of clusters. DBSCAN in such case is not suitable for carry out clustering and provide critical insights. In the following section, centroid-based clustering algorithms will be conducted.
First and foremost, by using the “silhouette”, “elbow” and “gap-statistics” methods, the optimal numbers of clusters for K-means and PAM are suggested to be 4, 6, and 9.
kmeans_sil <- fviz_nbclust(data_z, FUNcluster = stats::kmeans, method = "silhouette")
pam_sil <- fviz_nbclust(data_z, FUNcluster = cluster::pam, method = "silhouette")
kmeans_wss <- fviz_nbclust(data_z, FUNcluster = stats::kmeans, method = "wss")
pam_wss <- fviz_nbclust(data_z, FUNcluster = cluster::pam, method = "wss")
kmeans_gap <- fviz_nbclust(data_z, FUNcluster = stats::kmeans, method = "gap_stat")
pam_gap <- fviz_nbclust(data_z, FUNcluster = cluster::pam, method = "gap_stat")
grid.arrange(kmeans_wss, pam_wss, kmeans_sil, pam_sil, kmeans_gap, pam_gap, ncol = 2, top="K-Means (Left) & PAM (Right)")
randIndex(cclust(data_z, 4, dist="euclidean"), cclust(data_z, 6, dist="euclidean"))
## ARI
## 0.4710059
randIndex(cclust(data_z, 4, dist="euclidean"), cclust(data_z, 9, dist="euclidean"))
## ARI
## 0.3337932
comPart(cclust(data_z, 4, dist="euclidean"), cclust(data_z, 6, dist="euclidean"))
## ARI RI J FM
## 0.7285970 0.8719814 0.7073604 0.8341834
comPart(cclust(data_z, 4, dist="euclidean"), cclust(data_z, 9, dist="euclidean"))
## ARI RI J FM
## 0.3965191 0.7463164 0.3611144 0.5851535
According to the rand index, the clustering solutions are very similar (with high ARI, RI, J, and FM) for 4 clusters and 6 clusters. The clustering solutions are less similar (with a noticeable drop in ARI and other metrics) for 4 clusters and 9 clusters.
create_fviz_silhouette <- function(sil_object, method, k) {
avg_width <- summary(sil_object)$avg.width
fviz_silhouette(sil_object) +
ggtitle(paste(method, "(k =", k, ")")) +
labs(subtitle = paste("Avg Silhouette Width:", round(avg_width, 3)))
}
p1 <- create_fviz_silhouette(silhouette(kmeans(data_z, 4)$cluster, dist(data_z)), "K-Means", 4)
p2 <- create_fviz_silhouette(silhouette(pam(data_z, 4)$cluster, dist(data_z)), "PAM", 4)
p3 <- create_fviz_silhouette(silhouette(kmeans(data_z, 6)$cluster, dist(data_z)), "K-Means", 6)
p4 <- create_fviz_silhouette(silhouette(pam(data_z, 6)$cluster, dist(data_z)), "PAM", 6)
p5 <- create_fviz_silhouette(silhouette(kmeans(data_z, 9)$cluster, dist(data_z)), "K-Means", 9)
p6 <- create_fviz_silhouette(silhouette(pam(data_z, 9)$cluster, dist(data_z)), "PAM", 9)
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)
kmeans_shadow <- cclust(data_z, 4, dist = "euclidean") # Shadow statistics for K-Means
shadow(kmeans_shadow)
## 1 2 3 4
## 0.8255248 0.4926047 0.7848376 0.6964742
plot(shadow(kmeans_shadow))
While “4 clusters” is consistently suggested by “elbow” and “silhouette”, “gap-statistics” suggests 6 clusters and 9 clusters for K-means and PAM respectively. In such case, the investigation of average silhouette widths is crucial for comparing the quality of different clusters (4, 6, and 9) when using K-emans and PAM. We can see that applying K-means for 4 clusters usually has the better result. On the other hand, the shadow statistics are generally high, which proves that the 4 clusters can be well-separated and well-defined.
pca_kmeans <- prcomp(data_z, scale = TRUE)
kmeans <- kmeans(data_z, 4)
pca_with_clusters <- data.frame(pca_kmeans$x[, 1:2], Cluster = as.factor(kmeans$cluster))
ggplot(pca_with_clusters, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "K-Means Clustering Visualized by PCA", x = "Principal Component 1", y = "Principal Component 2") +
theme_minimal() + theme(panel.grid = element_blank())
From the visualization of PCA for K-means, we can see that the data points in clusters are more evenly distributed, compared to DBSCAN. However, there is significant overlap between clusters and data points which is hard to visually distinguish them. Some clusters and data points are also spread out and it indicates that PCA may not be capturing non-linear relationships in the data.
tsne_result <- Rtsne(data_z, dims = 2, perplexity = 30, check_duplicates = FALSE)
tsne_data_kmeans <- data.frame(tsne_result$Y)
colnames(tsne_data_kmeans) <- c("TSNE1", "TSNE2")
tsne_kmeans <- kmeans(tsne_data_kmeans, 4)
tsne_data_kmeans$cluster <- factor(tsne_kmeans$cluster)
ggplot(tsne_data_kmeans, aes(x = TSNE1, y = TSNE2, color = cluster)) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "K-Means Clustering Visualized by t-SNE", x = "t-SNE Dimension 1", y = "t-SNE Dimension 2") +
theme_minimal() + theme(panel.grid = element_blank())
index.DB(pca_kmeans$x[, 1:2], kmeans$cluster)$DB
## [1] 0.8000018
index.DB(tsne_result$Y, tsne_kmeans$cluster)$DB
## [1] 0.7845071
From the visualization of t-SNE for K-means, we can see that the data points and clusters are more distinct and less overlapping which is easier to interpret visually. T-SNE focuses on local relationships and it is cluster-focused which helps preserve local cluster structure. The points within clusters are more concentrated and it captures underlying structures better, especially for non-linear separations. Besides, by comparing the davies-bouldin index of PCA and t-SNE, t-SNE can provide better compactness and separation in clusters because lower values are better. Thus, from the perspectives of cluster separation and clarity, t-SNE can distinguish between clusters visually, and the reduction of overlap and concentration of points helps highlight distinct groupings which is beneficial for further exploratory analysis. Now, let’s try to add the players’s names to the clusters to see how they are grouped and who they are .
## Cluster 1: A. Lagumdzija, Alan, Amin, Anderson, Asparuhov, Atanasijevic, Bayram, Bednorz, Boladz, Bottolo, Bovolenta, Brand, Cebulj, Clevenot, Conte, Darlan, Defalco, Dimitrov, Ewert, Faure, Fornal, Gurbuz, Ishikawa, Ivovic, Jorna, Kai, Karlitzek, Kovacevic, Kujundzic, Kurek, Lavia, Leal, Leon, Lima B., Loeppky, Lopez, Louati, Luburic, Lucarelli, M. Lagumdzija, Maar, Mandiraci, Michieletto, Milad, Miyaura, Morteza, Nimir, Nishida, Otsuka, Palonsky, Patry, Peric, Poriya, Porro L., Ran, Reichert, Romano, Salehi, Schott, Semeniuk, Sliwka, Szwarc, T. Stern, Tatarov, Tillie, Tomita, Urnaut, Van Garderen, Vicentin, Yant
##
## Cluster 3: Adamczyk, Ahyi, Arshia, Arthur, Aslan, Bak, Barbast, Bardarov, Barnes, Bostan, Briggs, Brizard, Bruno, Burggraf, Camino, Caneschi, Charles, Christenson, Cortesia, Currie, D. Kolev, De Cecco, De Weijer, Dobrev, Eksi, Esfandiar, Falaschi, Fernando, Firlej, Fukatsu, Gaggini, Gallego, Gardini, Giannelli, Goralik, Gorguet, Herr, Isaacson, Jakubiszak, Janusz, Javad, Jovovic, Kampa, Kartev, Keemink, Klok, Kolev, Komenda, Krzic, Kunstmann, Lomacz, Ma'a, Marovt, Marshman, Matheus G., Meijs, Mujanovic, Nachev, Oya, Peter, Planinsic, Porro P., Rinaldi, Ropret, S. Nikolov, Sadati, Sanchez Pages, Sbertoli, Schnitzer, Sekita, Stankov, Taboada, Telkiyski, Thondike, Tille, Todorovic, Toman, Toniutti, Tuaniga, Vadi, Van Der Ent, Vincic, Walsh, Wetter, Wildman, Yenipazar, Z. Stern, Zimmermann
##
## Cluster 2: Adriano, Andringa, Antov, Arman, Aydin, Balaso, Bozhilov, Butryn, Carle, Champlin, Chinenyeze, Chinenyeze.1, Dagostino, Danani, Diez, Done, E. Shoji, Fromm, Garcia, Garkov, Gironi, Graven, Grebennikov, Hatipoglu, Hawryluk, Hazrat, Hoag, Hofer, Honorato, Jaeschke, John, Kaliberda, Kapur, Kessel, Koops, Kovacic, L. Bergmann, Laurenzano, Lui, Martinez Franchi, Mauricio, Mergarejo, Mozic, Muagututia, Nasri, Ngapeth, Ogawa, Palacios, Petrov, Popiwczak, Recine, Ristic, Roman Garcia, Russell, Subasi, Szalpuk, Szymura, Takanashi, Ter Horst, Thales, Valchinov, Yamamoto, Zatorski, Zenger
##
## Cluster 4: Alonso, Anzani, Averill, Bedirhan, Bieniek, Bohme, Bultor, Concepcion, Demyanenko, Ensing, Eshenko, Flavio, Galassi, Gasman, Grozdanov, Grozer, Gunes, Hanes, Holt, Huber, Isac, Jelveh, Jendryk, Jouffroy, Judson, Kaczmarek, Karyagin, Kentaro, Klos, Knigge, Kochanowski, Kozamernik, Krage, Krick, Krsmanovic, Kukartsev, Larry, Le Goff, Loser Bruno, Lucas, Maase, Masso, Masulovic N., Matic, McCarthy, Mohammad, Mosca, Moslehabadi, Nedeljkovic, Onodera, Pajenk, Parkinson, Petkov, Plak, Podrascanin, Poreba, Ramos, Russo, Sanguinetti, Savas, Seddik, Simon, Smith, Stalekar, Ter Maat, Van Berkel, Wassenaar Ketrzynski, Wiltenburg, Yamauchi, Zerba
From the grouped plots results above, we can see that there are 4 types of group of players clustered, based on their performance.
Right now, we can try to find the top 10 players in each cluster based on their total successful scores from 6 characteristics.
## Team Position total_sf_pt
## T. Stern SLO O 397
## Cebulj SLO OH 379
## Maar CAN OH 339
## Nimir NED O 339
## Yant CUB OH 318
## Loeppky CAN OH 310
## Clevenot FRA OH 284
## Urnaut SLO OH 279
## Van Garderen NED OH 275
## Lopez CUB OH 271
## Team Position total_sf_pt
## Kovacic SLO L 195
## Grebennikov FRA L 178
## Danani ARG L 176
## Thales BRA L 157
## Arman IRI L 156
## Roman Garcia CUB L 140
## Bozhilov BUL L 130
## Lui CAN L 127
## Yamamoto JPN L 127
## Andringa NED L 125
## Team Position total_sf_pt
## Ropret SLO S 465
## De Cecco ARG S 377
## Herr CAN S 376
## S. Nikolov BUL S 374
## Fernando BRA S 335
## Thondike CUB S 310
## Sekita JPN S 302
## Keemink NED S 295
## Vadi IRI S 280
## Christenson USA S 275
## Team Position total_sf_pt
## Kozamernik SLO MB 158
## Concepcion CUB MB 156
## Pajenk SLO MB 143
## Loser Bruno ARG MB 138
## Grozdanov BUL MB 119
## Maase GER MB 119
## Le Goff FRA MB 113
## Zerba ARG MB 109
## McCarthy CAN MB 108
## Simon CUB MB 108
K-means is used for clustering after comparing the outcomes and effects with hierarchical clustering, DBSCAN, and PAM. From the lists above and also based on the obervations made after K-Means clustering, we can observe the below circumstances. The players from the first category usually have extraordinary performance in almost all aspects. They are the ace and main scorer of a team. Their positions in a team are basically opposite hitter (O) and outside hitter (OH). Their responsibilities include hitting the balls, passing in serve receive, playing defense, and blocking. Therefore, it is understandable that they achieved high points on all features except setting balls. The players from the second category are mostly middle blocker (MB). They are the team’s best blockers. They are usually not good at setting balls, therefore MB get the fewest points in successful set but have the better attack points. The players from third category are defensive and serve-receive specialists. Referring the data retrieved, they are all Liberos (L). Liberos are not allowed to attack the balls and set the balls in front of the net. They usually play as a “back-up” or “support” for a team. In such case, the points successful dig is comparatively high with other clusters. The players from the forth category are mainly setters (S). The setter is actually the “decision-maker” of a team and is in charge of leading the strike. They are responsible to set a ball up for one of the hitters to attack during the second pass. Consequently, compared to other clusters, they have extremely low points in attack. Since the clusters identified can be able to match to real-world scenarios and are reasonably explainable, this is a strong indication of the validity and relevance of the clustering results.
There is a fun fact that many top players in each cluster are from Slovenia (SLO). They are able to reach high successful points in different perspectives. It may imply that the overall strengths from that team are relatively superior than others and can be strong opponents. Yet, according to the VNL national men team ranking, Poland (POL) is the first and Slovenia (SLO) ranks forth. From the clustering results, non team players from Poland is on the top 10 list. It is believed that there should be some additional information we have to consider when clustering. In such case, it is not doubtful that this study only conducted clustering based on given and selected characteristics and may have ignored the impact of other important characteristics or factors on players. In addition, there may also be certain biases in the selection of data samples. For more comprehensive and accurate research results, a broader and representative data set should be needed.
Using K-means Clustering to Create Training Groups for Elite American Football Student-athletes Based on Game Demands - https://journals.aiac.org.au/index.php/IJKSS/article/view/6092↩︎
Clustering of football players based on performance data and aggregated clustering validity indexes - https://www.degruyter.com/document/doi/10.1515/jqas-2022-0037/html↩︎